Fix legacy endpoint backwards compatibility for _use_legacy_endpoint feature flag by slister1001 · Pull Request #45727 · Azure/azure-sdk-for-python

slister1001 · 2026-03-16T21:34:11Z

Fix 7 bugs that prevented the _use_legacy_endpoint=True flag from being fully backwards compatible with the pre-sync-migration behavior:

Add bidirectional metric name mapping in evaluate_with_rai_service_sync() and evaluate_with_rai_service_sync_multimodal(): legacy endpoint gets hate_fairness, sync endpoint gets hate_unfairness, regardless of caller input.
Skip _parse_eval_result() for legacy endpoint in _evaluate_query_response(): legacy returns a pre-parsed dict from parse_response(), return directly.
Restore whole-conversation evaluation in _evaluate_conversation() when legacy endpoint: send all messages in a single call (pre-migration behavior) instead of per-turn evaluation.
Remove dead effective_metric_name variable in _evaluation_processor.py: metric normalization is now handled at the routing layer.
Pass evaluator_name in red team evaluation processor for telemetry.
Add use_legacy_endpoint parameter to Foundry RAIServiceScorer and forward it to evaluate_with_rai_service_sync(). Remove redundant manual metric name mapping (now handled by routing layer).
Update metric_mapping.py comment to document the routing layer approach.

Tests:

9 new unit tests in test_legacy_endpoint_compat.py covering query/response, conversation, metric enum, and _parse_eval_result paths
4 new unit tests in test_content_safety_rai_script.py covering routing, metric name mapping for both endpoints
5 new e2e tests in test_builtin_evaluators.py covering all content safety evaluators with legacy endpoint, key format parity, and conversation mode

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

…feature flag Fix 7 bugs that prevented the _use_legacy_endpoint=True flag from being fully backwards compatible with the pre-sync-migration behavior: 1. Add bidirectional metric name mapping in evaluate_with_rai_service_sync() and evaluate_with_rai_service_sync_multimodal(): legacy endpoint gets hate_fairness, sync endpoint gets hate_unfairness, regardless of caller input. 2. Skip _parse_eval_result() for legacy endpoint in _evaluate_query_response(): legacy returns a pre-parsed dict from parse_response(), return directly. 3. Restore whole-conversation evaluation in _evaluate_conversation() when legacy endpoint: send all messages in a single call (pre-migration behavior) instead of per-turn evaluation. 4. Remove dead effective_metric_name variable in _evaluation_processor.py: metric normalization is now handled at the routing layer. 5. Pass evaluator_name in red team evaluation processor for telemetry. 6. Add use_legacy_endpoint parameter to Foundry RAIServiceScorer and forward it to evaluate_with_rai_service_sync(). Remove redundant manual metric name mapping (now handled by routing layer). 7. Update metric_mapping.py comment to document the routing layer approach. Tests: - 9 new unit tests in test_legacy_endpoint_compat.py covering query/response, conversation, metric enum, and _parse_eval_result paths - 4 new unit tests in test_content_safety_rai_script.py covering routing, metric name mapping for both endpoints - 5 new e2e tests in test_builtin_evaluators.py covering all content safety evaluators with legacy endpoint, key format parity, and conversation mode Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The 5 new legacy endpoint e2e tests require test proxy recordings that don't exist yet. Mark them with pytest.mark.skip so CI passes in playback mode. The tests work correctly in live mode (verified locally). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Record 5 new legacy endpoint e2e tests (pushed to azure-sdk-assets) - Fix PROXY_URL callable check in conftest.py for local recording compat - Fix missing request.getfixturevalue() in test_self_harm_evaluator Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

These files import azure.ai.evaluation.red_team which requires pyrit, causing ImportError in CI environments without the redteam extra. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ests - Map groundedness -> generic_groundedness for legacy annotation endpoint - Set metric_display_name to preserve 'groundedness' output keys - Add e2e tests for ALL evaluators with _use_legacy_endpoint=True: GroundednessPro, ProtectedMaterial, CodeVulnerability, IndirectAttack, UngroundedAttributes, ECI Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace if/elif chains with _SYNC_TO_LEGACY_METRIC_NAMES dict used bidirectionally. Adding new metric mappings is now a one-line change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The legacy annotation API returns results under 'xpia' and 'eci' keys, not 'indirect_attack' and 'election_critical_information'. Without this mapping, parse_response cannot find the metric key in the response and returns empty dict. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The legacy annotation API returns XPIA results under 'xpia' and ECI under 'eci', but parse_response looked for 'indirect_attack' and 'election_critical_information'. Add _SYNC_TO_LEGACY_RESPONSE_KEYS fallback lookup in both parse_response and _parse_content_harm_response. Split mapping into two dicts: - _SYNC_TO_LEGACY_METRIC_NAMES: metrics where the API request name differs - _SYNC_TO_LEGACY_RESPONSE_KEYS: superset including response key differences Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ECIEvaluator uses _InternalEvaluationMetrics.ECI = 'election_critical_information' as metric_display_name, so output keys are election_critical_information_label, not eci_label. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR restores backward compatibility for the _use_legacy_endpoint=True feature flag across the Azure AI Evaluation RAI-service-backed evaluators and red team scoring, primarily by normalizing metric names between the legacy annotation endpoint and the newer sync_evals endpoint, and by aligning legacy conversation behavior with pre-sync-migration semantics.

Changes:

Add bidirectional metric/request/response-key normalization in rai_service.py and update evaluator routing to skip double-parsing for legacy responses.
Restore legacy whole-conversation evaluation behavior (single-call) and extend red team scorer/processor to pass endpoint choice and evaluator name for telemetry.
Add/extend unit + e2e tests for legacy endpoint parity (query/response, conversation, and metric mapping).

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_legacy_endpoint_compat.py	Adds focused unit tests for legacy endpoint behavior and parsing paths.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_content_safety_rai_script.py	Adds routing/mapping unit tests for sync↔legacy metric normalization.
sdk/evaluation/azure-ai-evaluation/tests/e2etests/test_builtin_evaluators.py	Adds e2e coverage for legacy endpoint across multiple built-in evaluators and output key parity.
sdk/evaluation/azure-ai-evaluation/tests/conftest.py	Makes proxy URL handling robust when `PROXY_URL` is not callable.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_utils/metric_mapping.py	Updates documentation comment to reflect routing-layer normalization.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py	Adds `use_legacy_endpoint` plumb-through to scoring calls and removes local metric remapping.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_evaluation_processor.py	Removes dead metric mapping logic and adds `evaluator_name` for telemetry.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py	Adjusts conversation/query-response routing for legacy vs sync endpoint behavior.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_common/rai_service.py	Introduces centralized sync↔legacy metric and response-key mappings; adds endpoint-based normalization.
sdk/evaluation/azure-ai-evaluation/assets.json	Updates assets tag reference.

Comments suppressed due to low confidence (1)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_common/rai_service.py:1333

evaluate_with_rai_service_sync_multimodal maps some sync metric names to legacy names when use_legacy_endpoint=True, then calls evaluate_with_rai_service_multimodal, which ultimately calls parse_response(..., metric_name) with no metric_display_name. That means the returned dict keys will be based on the legacy metric name (e.g., hate_fairness_*), not the canonical sync name (hate_unfairness_*), breaking key-parity/back-compat. Mirror the metric_display_name approach used in evaluate_with_rai_service_sync (e.g., plumb a display name through the multimodal legacy path or post-process keys before returning).

    # Normalize metric name (same mapping as evaluate_with_rai_service_sync)
    metric_name_str = metric_name.value if hasattr(metric_name, "value") else metric_name
    if use_legacy_endpoint:
        legacy_name = _SYNC_TO_LEGACY_METRIC_NAMES.get(metric_name_str)
        if legacy_name:
            metric_name = legacy_name
    else:
        _legacy_to_sync = {v: k for k, v in _SYNC_TO_LEGACY_METRIC_NAMES.items()}
        sync_name = _legacy_to_sync.get(metric_name_str)
        if sync_name:
            metric_name = sync_name

    # Route to legacy endpoint if requested
    if use_legacy_endpoint:
        return await evaluate_with_rai_service_multimodal(
            messages=messages,
            metric_name=metric_name,
            project_scope=project_scope,
            credential=credential,
        )

You can also share your feedback on Copilot code review. Take the survey.

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_common/rai_service.py

sdk/evaluation/azure-ai-evaluation/tests/unittests/test_content_safety_rai_script.py

sdk/evaluation/azure-ai-evaluation/tests/unittests/test_legacy_endpoint_compat.py

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_evaluation_processor.py

...evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py

@patch

- Define _LEGACY_TO_SYNC_METRIC_NAMES at module level (avoid rebuilding on every call) - Fix assertion in test to match string type (not enum) - Remove unused @patch decorator and cred_mock parameter - Delete test_legacy_endpoint_compat.py entirely - Fix effective_metric_name NameError in _evaluation_processor.py lookup_names - Route legacy conversation through sync wrapper for metric normalization - Remove unused evaluate_with_rai_service_multimodal import Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

nagkumar91

Code Review

🔴 Legacy conversation path returns wrong format (Critical)

_evaluate_conversation() legacy path (lines 140-147 of _base_rai_svc_eval.py) returns the raw result from evaluate_with_rai_service_sync_multimodal() directly — a flat dict like {"violence": 0, "violence_score": 0, ...}. But conversation evaluation is expected to return an evaluation_per_turn structure built by _aggregate_results(). The legacy path bypasses _aggregate_results() entirely.

The e2e test test_content_safety_evaluator_conversation_with_legacy_endpoint asserts:

assert "evaluation_per_turn" in score
assert len(score["evaluation_per_turn"]["violence"]) == 2

This will fail since the flat dict from the legacy endpoint has no evaluation_per_turn key. Either the legacy path needs to wrap its result through _aggregate_results(), or the test expectations need to match the actual legacy output format.

🟡 Missing `metric_display_name` in multimodal path (High)

evaluate_with_rai_service_sync() correctly sets metric_display_name = metric_display_name or metric_name_str before remapping to the legacy name, so output keys remain hate_unfairness_*. However, evaluate_with_rai_service_sync_multimodal() does NOT set metric_display_name:

# evaluate_with_rai_service_sync (correct):
if legacy_name:
    metric_display_name = metric_display_name or metric_name_str  # ✅ preserves original name
    metric_name = legacy_name

# evaluate_with_rai_service_sync_multimodal (missing):
if legacy_name:
    metric_name = legacy_name  # ❌ no metric_display_name preservation

This means legacy multimodal results will produce keys under the legacy name (hate_fairness_*) instead of the expected name (hate_unfairness_*).

Fix: Add metric_display_name = metric_display_name or metric_name_str in the multimodal function before metric_name = legacy_name.

🟡 Duplicated normalization logic (Medium)

The 10-line metric normalization block is copy-pasted in both evaluate_with_rai_service_sync() and evaluate_with_rai_service_sync_multimodal(). This duplication already caused bug #2 above (missing metric_display_name in one copy).

Fix: Extract to a helper:

def _normalize_metric_for_endpoint(metric_name, use_legacy_endpoint, metric_display_name=None):
    metric_name_str = metric_name.value if hasattr(metric_name, "value") else metric_name
    if use_legacy_endpoint:
        legacy_name = _SYNC_TO_LEGACY_METRIC_NAMES.get(metric_name_str)
        if legacy_name:
            return legacy_name, (metric_display_name or metric_name_str)
    else:
        sync_name = _LEGACY_TO_SYNC_METRIC_NAMES.get(metric_name_str)
        if sync_name:
            return sync_name, metric_display_name
    return metric_name, metric_display_name

🟡 Response key fallback applies unconditionally (Medium)

The legacy response key fallback in parse_response() and _parse_content_harm_response() applies even when NOT using the legacy endpoint:

if metric_name not in batch_response[0]:
    legacy_key = _SYNC_TO_LEGACY_RESPONSE_KEYS.get(...)
    if legacy_key and legacy_key in batch_response[0]:
        response_key = legacy_key  # runs even for sync endpoint

If the sync endpoint returns an unexpected format, this silently falls back to checking legacy keys instead of failing fast. This could mask API contract violations.

Fix: Thread use_legacy_endpoint through to these functions and only apply the fallback when use_legacy_endpoint=True.

- Extract _normalize_metric_for_endpoint() helper (fixes duplication + ensures metric_display_name is set in both sync and multimodal paths) - Fix legacy conversation path to produce evaluation_per_turn structure by wrapping result through _aggregate_results() - Add comments clarifying response key fallback is inherently legacy-only (parse_response is only called from legacy endpoint functions) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

nagkumar91

Thanks for addressing the prior review. The helper extraction and _aggregate_results wrapping look good. Two remaining issues:

🔴 Conversation legacy test asserts wrong per-turn count (will fail)

test_content_safety_evaluator_conversation_with_legacy_endpoint asserts:

assert len(score["evaluation_per_turn"]["violence"]) == 2
assert len(score["evaluation_per_turn"]["violence_score"]) == 2

But the legacy path sends the entire conversation in a single call, then wraps the single result via _aggregate_results([result]). That produces evaluation_per_turn lists of length 1 (one aggregated result), not 2. The sync path evaluates per-turn (2 turns → length 2), but the legacy path intentionally evaluates as a single batch.

Fix: Either change assertions to == 1 (matching actual legacy behavior), or evaluate per-turn in legacy mode too (but that would contradict the stated goal of matching pre-migration behavior).

🟡 `metric_display_name` still not propagated in multimodal legacy path (High)

evaluate_with_rai_service_sync_multimodal discards the display name:

metric_name, _ = _normalize_metric_for_endpoint(metric_name, use_legacy_endpoint)

Downstream, evaluate_with_rai_service_multimodal() calls parse_response(annotation_response, metric_name) without metric_display_name, so output keys use the legacy name (hate_fairness_score, hate_fairness_reason) instead of the expected sync name (hate_unfairness_score, hate_unfairness_reason).

This affects HateUnfairnessEvaluator and GroundednessProEvaluator in conversation mode with legacy endpoint. It's untested because the only conversation+legacy test uses ViolenceEvaluator (no remapping needed).

Fix: Thread metric_display_name through to evaluate_with_rai_service_multimodal() and pass it to parse_response(). This requires adding the parameter to the multimodal function signature.

- Fix conversation legacy test: assert per-turn length == 1 (not 2), since legacy sends entire conversation as single call - Thread metric_display_name through evaluate_with_rai_service_multimodal so legacy multimodal results use correct output key names (e.g. hate_unfairness_* not hate_fairness_*) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

nagkumar91

All four issues from previous reviews are resolved:

Duplicated normalization → Extracted to _normalize_metric_for_endpoint() helper ✅
metric_display_name in multimodal → Now threaded through full chain: _normalize_metric_for_endpoint → evaluate_with_rai_service_sync_multimodal → evaluate_with_rai_service_multimodal (new param) → parse_response ✅
Conversation legacy format → Wrapped via _aggregate_results([result]); test correctly asserts == 1 per-turn entries with explanatory comment ✅
Response key fallback scope → Comments clarify parse_response is inherently legacy-only ✅

LGTM.

…_to_eval_input The parent class _convert_kwargs_to_eval_input decomposes text conversations into per-turn {query, response} pairs before _do_eval is called, routing to _evaluate_query_response instead of _evaluate_conversation. This bypasses the legacy single-call logic entirely. Override _convert_kwargs_to_eval_input in RaiServiceEvaluatorBase to pass conversations through intact when _use_legacy_endpoint=True, so _evaluate_conversation is reached and sends all messages in one API call. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Move validate_conversation() call after the legacy endpoint check since it requires multimodal (image) content. Text conversations routed through the legacy path don't need this validation. Re-recorded test_content_safety_evaluator_conversation_with_legacy_endpoint in live mode and pushed new recordings. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 16, 2026

slister1001 force-pushed the fix/legacy-endpoint-backwards-compat branch from 2c18b75 to fa45fcc Compare March 16, 2026 22:20

slister1001 and others added 2 commits March 17, 2026 09:19

Remove local test scripts from tracking

876e9e5

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

slister1001 force-pushed the fix/legacy-endpoint-backwards-compat branch from 01cee1d to 876e9e5 Compare March 17, 2026 13:56

slister1001 and others added 2 commits March 17, 2026 10:28

Remove local test scripts that break CI collection

f849593

These files import azure.ai.evaluation.red_team which requires pyrit, causing ImportError in CI environments without the redteam extra. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

slister1001 force-pushed the fix/legacy-endpoint-backwards-compat branch from b02430f to f849593 Compare March 17, 2026 17:14

slister1001 and others added 2 commits March 17, 2026 13:50

Refactor metric name mapping to single dict

2f144c7

Replace if/elif chains with _SYNC_TO_LEGACY_METRIC_NAMES dict used bidirectionally. Adding new metric mappings is now a one-line change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

slister1001 force-pushed the fix/legacy-endpoint-backwards-compat branch from fadf4a1 to 2f144c7 Compare March 17, 2026 18:51

slister1001 and others added 4 commits March 17, 2026 14:56

adding recordings

c1db82f

slister1001 marked this pull request as ready for review March 17, 2026 22:12

slister1001 requested a review from a team as a code owner March 17, 2026 22:12

Copilot AI review requested due to automatic review settings March 17, 2026 22:12

Copilot started reviewing on behalf of slister1001 March 17, 2026 22:13 View session

posaninagendra approved these changes Mar 17, 2026

View reviewed changes

Copilot AI reviewed Mar 17, 2026

View reviewed changes

nagkumar91 reviewed Mar 18, 2026

View reviewed changes

posaninagendra approved these changes Mar 18, 2026

View reviewed changes

nagkumar91 reviewed Mar 18, 2026

View reviewed changes

nagkumar91 approved these changes Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix legacy endpoint backwards compatibility for _use_legacy_endpoint feature flag#45727

Fix legacy endpoint backwards compatibility for _use_legacy_endpoint feature flag#45727
slister1001 wants to merge 16 commits intoAzure:mainfrom
slister1001:fix/legacy-endpoint-backwards-compat

slister1001 commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nagkumar91 left a comment

Uh oh!

nagkumar91 left a comment

Uh oh!

nagkumar91 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

slister1001 commented Mar 16, 2026

Description

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nagkumar91 left a comment

Choose a reason for hiding this comment

Code Review

🔴 Legacy conversation path returns wrong format (Critical)

🟡 Missing metric_display_name in multimodal path (High)

🟡 Duplicated normalization logic (Medium)

🟡 Response key fallback applies unconditionally (Medium)

Uh oh!

nagkumar91 left a comment

Choose a reason for hiding this comment

🔴 Conversation legacy test asserts wrong per-turn count (will fail)

🟡 metric_display_name still not propagated in multimodal legacy path (High)

Uh oh!

nagkumar91 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

🟡 Missing `metric_display_name` in multimodal path (High)

🟡 `metric_display_name` still not propagated in multimodal legacy path (High)